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Regular expressions provide a flexible means for matching strings and they are often used in data- 
intensive applications. They are formally equivalent to either deterministic finite automata (DFAs) 
or nondeterministic finite automata (NFAs). Both DFAs and NFAs are affected by two problems 
known as amnesia and acalculia, and DFAs are also affected by a problem known as insomnia. 
Existing techniques require an automata conversion and compaction step that prevents the use of 
existing automaton databases and hinders the maintenance of the resulting compact automata. In 
this paper, we propose Parallel Finite State Machines (PFSMs), which are able to run any DFA- 
or NFA-like state machines without a previous conversion or compaction step. PFSMs report, 
online, all the matches found within an input string and they solve the three aforementioned 
problems. Parallel Finite State Machines require quadratic time and linear memory and they 
are distributable. Parallel Finite State Machines make very fast distributed regular expression 
matching in data-intensive applications feasible. 



I. INTRODUCTION 

A regular expression, commonly called regex, provides 
a flexible means for matching strings 0] often used in 
data-intensive applications such as lexical analysis in lan- 
guage processors [13, [ill ' XML tokenization [l^ , biblio- 
graphic search S, database queries spam filters [l9l |. 
virus detection , network intrusion detection [l[ , and 
other data mining applications [X, 8\ . 

The implementation of regex matching relies on either 
deterministic finite automata (DFAs) or nondeterministic 
finite automata (NFAs) 9). 

Both DFAs and NFAs suffer from amnesia and acalcu- 
lia, and DFAs also suffer from insomnia [ll|. Amnesia is 
the inability to consider the progress of multiple partial 
matches. Acalculia is the inability to find subexpression 
occurrences. Insomnia is the inability to unload from 
memory, or set to sleep, rarely used big portions of au- 
tomata when they are not needed. 

Several techniques exist that solve those problems by 
convertingsets of regular expressions into compact state 
machines [3. IgI. IT^ [iq . The required conversion or com- 
paction step prevents using existing automata that could 
be already available in antivirus signature databases or 
complex data filters. It also makes the modification of 
the set of regular expressions more difficult, as the con- 
version and compaction steps have to be performed every 
time they are modified. 

In this paper, we propose Parallel Finite State Ma- 
chines (PFSM). 

A Parallel Finite State Machine is an automaton that 
can have multiple active states. It can be used for effi- 
ciently finding all the matches of a set of regular expres- 
sions in a given input string, and it solves the amnesia 
and acalculia problems. They also mitigate the effect of 
insomnia by reducing the number of states in the result- 
ing automaton. 

Parallel Finite State Machines do not require conver- 
sion and compaction steps and they can, therefore, be 



run on existing DFAs or NFAs. They also allow the ef- 
ficient addition or removal of regular expressions from 
existing automata. 

Parallel Finite State Machines can perform regular ex- 
pression matching in quadratic order of efficiency and 
have linear memory space requirements, apart from au- 
tomata storage, in the worst case. Moreover, they al- 
low the parallelization of the regular expression matching 
process with almost linear scalability in the practice. 



II. BACKGROUND 

A DFA is a tuple (S, Q, qo, 6, F), where: 

• E is the input alphabet (i.e. a finite, non-empty set 
of symbols). 

• Q is a finite, non-empty set of states. 

• qo is an initial state, an element of Q. 

• S is the state-transition function: S : Q x Y, ^ Q. 

• F is the set of final states, a subset of Q. 

In a NFA, the state-transition function would be S : 
Q X E — >■ ViQ). In other words, S returns a set of states. 

NFA-like state machines can be converted into DFA- 
like machines by expanding their states and removing 
any nondeterminism (T^]. Each state in the DFA will 
correspond to a set of states in the NFA. Although DFAs 
are faster than NFAs, there can be up to 0(2") states in 
a DFA, being n the number of states in the NFA. 

Amnesia can be inefficiently solved by using a separate 
state for each combination of every potentially simulta- 
neous partial match [llj . Somehow, this is analogous the 
approach of converting NFAs into DFAs, but much more 
memory-intensive and therefore, quite impractical. 

Acalculia can be solved by repeating the whole regex 
matching process for each input string index as a starting 
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position and stopping at each possible ending position 
[16| . This solution is time-intensive and does not allow 
for the progress of partial matches to be simultaneously 
considered, thus preventing the solution of amnesia. 

When the number of states in an automaton increases, 
as a result of converting a NFA into a DFA or consider- 
ing simultaneous partial matches, memory use skyrockets 
and insomnia has to be taken into consideration. Insom- 
nia can be solved by swapping rarely used portions of 
automata out to the hard drive when they have not been 
used for a while [ll|- However, that approach presents 
two major drawbacks. When those portions are used 
back again, there might be a swapping delay. Also, au- 
tomata might be so huge that even swapping it to hard 
drive could be impractical. The most common approach 
in practice consists of avoiding the expansion of states 
0, H : thus eliminating the major cause of insomnia. 

DotStar [T5l| solves amnesia and acalculia by converting 
sets of regular expressions into compact state machines 
pertaining to a subclass of DFA with status bits associ- 
ated to their states. 

The technique proposed in 4] reduces the memory re- 
quirement of DFAs by merging non-equivalent states and 
labelling their input and output transitions. 

D^FAs [13] reduce the memory requirement of DFAs 
by assuming default transitions for all the states. De- 
fault transitions differ from epsilon transitions in that, 
in case a state cannot find a transition for a specific in- 
put symbol, the default transition will be followed and a 
transition with that specific input symbol will be looked 
for in the target state. Several default transitions could 
be followed for processing a single symbol. 

SFA is an extension of D^FA that determines some 
of the results of chaining default transitions and non- 
default transitions and makes them explicit in the au- 
tomaton for optimization. It also uses local memory to 
store the set of transitions going out from a state that is 
the source of a default transition. If the default transition 
of the target state brings to the source state, the tran- 
sition set is known without having to follow the default 
transition back to the source state. 

These techniques require a conversion or compaction 
step that makes them unable to run on existing uncom- 
pacted DFA- or NFA-like state machines that could be al- 
ready available in automaton databases. Moreover, com- 
paction forbids regular expressions to be added or re- 
moved from the state machine directly. Therefore, com- 
paction hampers the maintainability of the automata. It 
should be noted that the only way to add or remove reg- 
ular expressions from the state machine would involve 
converting and compacting the whole new regular ex- 
pression set, which is costly in terms of time when the 
set of regular expressions is complex. 

Lamb [31 is a lexical analyzer that partially solves 
acalculia by greedily matching every pattern starting at 
every position of the input string. However, it does not 
find all the possible submatches, as greedily-matched reg- 
ular expressions find the longest possible matching and 



discard any shorter matchings. That is, even though 
Lamb finds matchings starting at any position in the in- 
put string, it does not finds matchings ending at different 
positions in the input string for the same starting posi- 
tion. This implies that some matchings may indeed be 
missed (i.e. Lamb still suffers from acalculia). 

III. PARALLEL FINITE STATE MACHINE 

In order to solve the amnesia, acalculia, and insom- 
nia problems, we propose the use of Parallel Finite State 
Machines (PFSMs). 

A. Definition 

A PFSM is a concurrent automaton in which several 
states may be active at the same time and final states 
include associated labels. This feature enables matching 
an input string at several starting positions with a set 
of regular expressions in order to find all the possible 
matches. A PFSM generalizes all existing combinations 
of DFA- and NFA-like state machines. 

A PFSM is a tuple (S, Q, qo, 6, F, L, I), where: 

• E is the input alphabet. 

• Q is a finite, non-empty set of states. 

• qo is an initial state, an element of Q. 

• (5 is a NFA-like state-transition function: d : Q x 

• F is the set of final states, a subset of Q. 

• L is the set of labels that identify the different reg- 
ular expressions. 

• I is the state-label function: I : F ^ L. 

A set of DFAs representing different regular expres- 
sions can be put together by using an ancillary initial 
state with epsilon transitions to each of the initial states 
of the different DFAs, and by assigning each final state 
a label identifying that identifies the regular expression 
it represents. A PFSM also allows trading off process- 
ing time for memory use by considering NFAs instead of 
DFAs for some regular expressions. 

For example, an automaton that considers the regular 
expressions "a*c", "ac", and "a(ca)*b" is shown in Figure 
[H When trying to match the input string "aacacab" , that 
automaton returns all the matches shown in Figure [2j 

As the whole automaton does not need to be converted 
into a single DFA, it does not have to be expanded with 
too many states, which cuts down its memory require- 
ments, thus mitigating the impact of insomnia. 

Also, as amnesia is solved by allowing several simul- 
taneously active states that can represent parallel par- 
tial matchings, it is unnecessary to expand the automa- 
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Figure 1 A PFSM that compiles the "a*c", "ac", and 
"a(ca)*b" regular expressions. 







INPUT 


"aacacab 


"a*c" 


matches 


[0-2] 


"aac" 


"a*c" 


matches 


[1-2] 


"ac" 


"ac" 


matches 


[1-2] 


"ac" 


"a*c" 


matches 


[2-2] 


"c" 


"a+c" 


matches 


[3-4] 


"ac" 


"ac" 


matches 


[3-4] 


"ac" 


"a*c" 


matches 


[4-4] 


"c" 


a(ca) *b" 


matches 


[1-6] 


"acacab 


a(ca) *b" 


matches 


[3-6] 


"acab 


a(ca) *b" 


matches 


[5-6] 


"ab 



Figure 2 Results of matching the "aacacab" input string using 
the PFSM in Figure [T] 



ton with states representing different simultaneous par- 
tial matchings. This drastically reduces the impact of 
insomnia. 

Furthermore, as the automaton does not need to be 
compacted, regular expressions can be efficiently added 
or removed from a PFSM. This makes PFSM mainte- 
nance easier. 



B. Algorithm 

A PFSM is not only initialized once, by activating the 
initial state at the start of the input string. It is also reini- 
tialized every time an input symbol is going to be pro- 
cessed, in order to consider simultaneous partial match- 
ings starting at different positions in the input string. 

Whenever the automaton is initialized or reinitialized, 
the initial state is activated and it is annotated with a tag 
that represents the current input string position, which 
will be used when a match is completed in order to know 
the starting position of the matched substring. The tag 
will be propagated unmodified to other states when ap- 
plying transitions, and a state can include several tags at 
any given time, which will correspond to multiple simul- 
taneous matches being in the same state while starting 
at different input string positions. It should be noted 
that, whenever the automaton is reinitialized, any active 
states are kept active. 

Using a PFSM, each input symbol is processed in four 



steps: 

1. The automaton is initialized or reinitialized by ac- 
tivating the initial state and tagging it with the 
current input string position. 

2. All epsilon transitions from the active states (e.g. 
the initial ancillary state) are iteratively applied 
and the target states of those transitions are 
tagged. 

3. The input symbol is consumed and, from all the 
transitions from the active states, only those that 
match the current symbol are traversed. 

4. The traversed transitions are applied and the target 
states of those transitions are tagged. 

Figure [3] shows how the PFSM from Figure[T]consumes 
the two symbols in the "ac" input string. 

Whenever a transition is applied, in the second or 
fourth step of the algorithm above, the target state is 
checked. When it is final, a matching has been found. 
The matching starting position is given by the final state 
tag (or tags). The matching ending position is the cur- 
rent input string position. The matched regular expres- 
sion is identified by the label of the final state. 

C. Efficiency analysis 

Given an alphabet with s symbols and r regular ex- 
pressions, being m the number of states of an automaton 
representing a regular expression, a PFSM will contain 
mr + 1 states. A maximum of ri^r matchings will be 
performed in the input string. 

If regular expressions are implemented as DFAs, the 
resulting PFSM automaton may contain up to mr states 
and mrs transitions. At most, r states may be active 
when considering a single symbol of the input string. As 
matchings starting at different input string positions are 
considered, 0{nr) states may be active in the PFSM at 
any given time. The PFSM processes one symbol at a 
time, so the process is repeated for the n symbols in the 
input string. Therefore, PFSMs' efficiency is 0{n^r) in 
time and 0{nr + mrs) in space. 

If regular expressions are implemented as NFAs, the 
resulting PFSM automaton may contain up to mr states 
and rn^rs transitions. At most, mr states may be active 
when considering a single symbol of the input string. As 
matchings starting at different input string positions are 
considered, 0{nmr) states may be active in the PFSM 
at any given time. The PFSM processes one symbol at a 
time, so the process is repeated for the n symbols in the 
input string. Therefore, PFSMs' efficiency is 0{n^mr) in 
time and Olnmr -\- m^rs) in space. 

Our suggested implementation of PFSMs allows trad- 
ing off time for space by expressing some of the regular 
expressions as NFAs instead of DFAs. 

It should be noted that PFSMs, as described here, can 
be implemented as Moore-like machines. A Mealy-like 
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(a)Initialization (input[0] is 'a'). 




(c)Traiisitions matching 'a'. 




(e)Reinitialization (input[l] is 'c'). 




(g) Transitions matching 'c'. 




(b) Application of epsilon transitions. 




(d)AppUcation of transitions. 




(f)Application of epsilon transitions. 




(h)Application of transitions. Three matches ending at 
index 1 are found. 



Figure 3 Example PFSM running for two full cycles. The final state labels specify which regular expression that state 
corresponds to. Active states and transitions are shown in bold. The numbers between parentheses are the valid starting 
positions for the states. 
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implementation would yield no reduction on the number 
of states and would decrease the efficiency of the PFSM 
implementation both in terms of time and space. 

D. Parallelization 

The PFSM is a parallelizable automaton with an al- 
most linear scalability. 

1. Regular Expression Partitioning 

PFSMs can be partitioned by distributing the set of 
regular expressions among the different processors. Each 
processor will find the matches within a subset of regular 
expressions, which reduces the automaton size at a given 
processor. By following this approach, the memory use at 
a given processor can be reduced just to the space that is 
necessary to store a single regular expression. Although 
the idea of PFSM partitioning is trivial, the fact that 
PFSM does not compact the regular expression set that 
allows this kind of parallelization on the fly, in contrast 
to existing techniques that need to distribute the regular 
expression set and compact each subset separately. 

Furthermore, when some regular expressions are ex- 
pressed as NFAs in order to save memory, it is possible to 
distribute both NFA and DFA regular expressions among 
processors in such a way that memory use and processing 
time is balanced in each processor, as proposed in (T8| . 

2. Data Partitioning 

PFSMs can be partitioned by distributing the input 
string among the different processor. Two different ap- 
proaches are proposed: 

• Lazy data partitioning. The input string is dis- 
tributed among the different processors. Each pro- 
cessor will find the matches that start within its 
segment of the input string and end either in or af- 
ter that segment. To achieve this, the automaton 
reinitialization is disabled after surpassing the end- 
ing position of the segment, but the processing does 
not stop until the end of the input string is reached 
or there are no active states. Each processor may 
need to ask other processors for their segment in 
order to finish its processing. 

• Chained data partitioning. The input string is 
distributed among the different processors. Each 
processor will find the matches that start and end 
within its segment. The tagged active states at the 
end of the processing are sent to the next processor, 
a different set of tagged active states are received 
from the previous processor and the whole cycle is 
repeated (without reinitializations) until no more 
tagged active states are received from the previous 
processor. 



Data partitioning linearly reduces memory use with re- 
spect to the number of processors. Assuming a maximum 
token length of twice the segment length, data partition- 
ing also reduces processing time linearly with respect to 
the number of processors. 



IV. CONCLUSIONS AND FUTURE WORK 

Regular expressions provide a flexible means for 
matching strings and they are often used in data- 
intensive applications. 

The implementation of regular expression matching 
typically relies on deterministic finite automata (DFAs) 
or nondeterministic finite automata (NFAs) . Both DFAs 
and NFAs are affected by amnesia and acalculia, and 
DFAs also suffer from insomnia. 

Techniques exist that solve those problems by con- 
verting sets of regular expressions into compact state 
machines. This approach, however, presents two ma- 
jor drawbacks: it prevents the use of existing automata 
that could be already available in antivirus signature 
databases or complex data filters; and it hinders the 
maintenance of the resulting automata, since the whole 
set has to be converted and compacted again whenever a 
regular expression has to be added or removed from the 
set. 

We have proposed Parallel Finite State Machines (PF- 
SMs), which allow multiple active states and efficiently 
find all the matches of a set of regular expressions in an 
input string, solving amnesia and acalculia. PFSMs also 
mitigate the effect of insomnia by reducing the number 
of states in the resulting automaton. 

As PFSMs do not require any conversion or com- 
paction step, they are able to run on existing DFA- or 
NFA-like machines, and they allow the addition or re- 
moval of regular expressions with zero downtime, making 
the maintenance of the automaton easier. 

PFSMs can perform regular expression matching in 
quadratic time and have linear memory space require- 
ments, apart from automata storage, in the worst case. 
On top of that, PFSMs support three different ap- 
proaches for parallelization with almost linear scalability 
in the practice. 

Therefore, PFSMs make very fast distributed regular 
expression matching in data-intensive applications feasi- 
ble. 

We plan to apply PFSMs to cure Lamb [l6| from its 
partial acalculia. We also plan to apply PFSMs to other 
data-intensive applications. 
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